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Abstract 

The monolithic approach to policy representation in Markov Decision Processes (MDPs) looks for 
a single policy that can be represented as a function from states to actions. For the monolithic 
approach to succeed (and this is not always possible), a complex feature representation is often nec¬ 
essary since the policy is a complex object that has to prescribe what actions to take all over the state 
space. This is especially true in large domains with complicated dynamics. It is also computation¬ 
ally inefficient to both learn and plan in MDPs using a complex monolithic approach. We present a 
different approach where we restrict the policy space to policies that can be represented as combi¬ 
nations of simpler, parameterized skills—a type of temporally extended action, with a simple policy 
representation. We introduce Learning Skills via Bootstrapping (LSB) that can use a broad family 
of Reinforcement Learning (RL) algorithms as a “black box” to iteratively learn parametrized skills. 
Initially, the learned skills are short-sighted but each iteration of the algorithm allows the skills to 
bootstrap off one another, improving each skill in the process. We prove that this bootstrapping pro¬ 
cess returns a near-optimal policy. Furthermore, our experiments demonstrate that LSB can solve 
MDPs that, given the same representational power, could not be solved by a monolithic approach. 
Thus, planning with learned skills results in better policies without requiring complex policy repre¬ 
sentations. 


1 Introduction 


State-of-the-art Reinforcement Learning (RL) algorithms need to produce compact solutions to large or continuous 
state Markov Decision Processes (MDPs), where a solution, called a policy, generates an action when presented with 
the current state. One such approach to producing compact solutions is linear function approximation. 


MDPs are important for both planning and learning in Reinforcement Learning (RL). The RL planning problem uses 
an MDP model to derive a policy that maximizes the sum of rewards received, while the RL learning problem learns 
an MDP model from experience (because the MDP model is unknown in advance). In this paper, we focus on RL 
planning, and use insights from RL that could be used to scale up to problems that are unsolvable with traditional 
planning approaches (such as Value Iteration and Policy Iteration (c.l.,[Piiterman 119941). A general result from ma- 
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chine learning is that the sample complexity of learning increases with the complexity of the representation Vapnik 
[1998) . In a planning scenario, increased sample complexity directly translates to an increase in computational com¬ 
plexity. Thus monolithic approaches, which learn a single parametric policy that solves the entire MDP, scale poorly. 
This is because they often require highly complex feature representations, especially in high-dimensional domains 
with complicated dynamics, to support near-optimal policies. Instead, we investigate learning a collection of policies 
over a much simpler feature representation (compact policies) and combine those policies hierarchically. 


Generalization: the ability of a system to perform accurately on unseen data, is important for machine learning in 
general, and can be achieved in this context by restricting the policy space, resulting in compact policies jBertsekas] 
1 1995), Sutton 119961. Policy Search (PS) algorithms, a form of generalization, learn and maintain a compact policy 
representation so that the policy generates similar actions in nearby states |Peters a nd Schaal|||2008|,|Bhatnagar et ah 
12009). 


Temporally Extended Actions [TEAs, | Sutton et al. 1999): Compact policies can be represented and combined hi¬ 


erarchically as TEAs. TEAs are control structures that execute for multiple timesteps. They have been extensively 
studied under different names, including skills Konidaris and Barto | |2009[ , macro-actions Hauskrecht et al. [ 19981, 
He et al. 1201 1|, and options Sutton et al. 


[1999 


TEAs are known to speed up the convergence rate of MDP planning 
algorithms | Sutton et al.| |fl999 1, Mann and Mannor [2014], However, the effectiveness of planning with TEAs depends 
critically on the given actions. For example. Figure [I]/ depicts an episodic MDP with a single goal region and skills 
{( 7 i, <J‘ 2 i ■ ■ ■ In this 2D setting, each skill represents a simple movement in a single, linear direction. Most of 

the TEAs move towards the goal region, but 175 moves in the opposite direction of the goal making it impossible to 
reach. With these TEAs, we cannot hope to derive a satisfactory solution. On the other hand, if one of the TEAs takes 
the agent directly to the goal (Figure [TJ?, the monolithic approach), then planning becomes trivial. Notice, however, 
that this TEA may be quite complex, and therefore difficult to learn since, in this 2D setting, it represents non-linear 
movements in multiple directions. 



Figure 2: A partitioning of a target MDP in the pinball do¬ 
main. Each sub-partition (partition class) i represents the 
skill MDP Mi 


Figure 1: TEAs in an episodic MDP with S-shaped state- 
space and goal region G. (a) Although most actions move 
toward the goal, a 5 moves away from the goal making it 
impossible to complete the task. ( b ) Planning becomes triv¬ 
ial when a single TEA takes the agent directly to G. 

Learning a useful set of TEAs has been a topic of intense research McGovern and Barto 1 2001 1 , Moerman [20091, 
Konidaris and Barto [20091, Brunskill and Li| [ 2014) , |Hauskrecht et al. ] fl998| . However, prior work suffers from 
the following drawbacks: ( 1 ) lack of theoretical analysis guaranteeing that the derived policy will be near-optimal 
in continuous state MDPs, (2) the process of learning TEAs is so expensive that it needs to be ammortized over 
a sequence of MDPs, (3) the approach is not applicable to MDPs with large or continuous state-spaces, or (4) the 
learned TEAs do not generalize over the state-space. We provide the first theoretical guarantees for iteratively learning 
a set of simple, generalizable parametric TEAs (skills) in a continuous state MDP. The learned TEAs solve the given 
tasks in a near-optimal manner. 

Skills: Generalization & Temporal Abstraction: Skills are TEAs defined over a parametrized policy. Thus, they 
incorporate both temporal abstraction and generalization. As TEAs, skills are closely related to options Sutton et al. 
11999) developed in the RL literature. In fact, skills, as defined here, are a special case of options. Therefore, skills 
inherit many of the useful theoretical properties of options (e.g.. Precup et ak]| 1998) ). The main difference between 
skills and more general options is that skills are based on parametric policies that can be initialized and reused in any 
region of the state space. 


We introduce a novel meta-algorithm. Learning Skills via Bootstrapping (LSB), that uses an RL algorithm as a “black 
box” to iteratively learn parametrized skills. The learning algorithm is given a partition of the state-space, and one 
skill is created for each class in the partition. This is a very weak requirement since any partition could be used, such 
as a grid. During an iteration, an RL algorithm is used to update each skill. The skills may be initialized arbitrarily, but 
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after the first iteration skills with access to a goal region or non-zero rewards will learn how to exploit those rewards 
(e.g.. Figure [2] Iteration 1). On further iterations, the newly acquired skills propagate reward back to other regions 
of the state-space. Thus, skills that previously had no reward signal bootstrap off of the rewards of other skills (e.g.. 
Figure [2j Iterations 2 and 5). Although each skill is only learned over a single partition class, it can be initialized in 
any state. 


It is important to note that this paper deals primarily with learning TEAs or Skills that aid in both speeding up the 
convergence rate of RL planning algorithms Sutton et al. 1 1999) , Mann and Mannor )2014| , as well as enabling larger 
problems to be solved using skills with simple policy representations. Utilizing simple policy representations is ad¬ 
vantageous since this results in better generalization and better sample efficiency. These skills represent a misspecified 
model of the problem since they are not known in advance. By learning skills, we are also therefore inherently tackling 
the learning problem as we are iteratively correcting a misspecified model. 

Contributions: Our main contributions are (1) The introduction of Learning Skills via Bootstrapping (LSB), which 
requires no additional prior knowledge apart from a partition over the state-space. (2) LSB is the first algorithm 
for learning skills in continuous state-spaces with theoretical convergence guarantees. (3) Theorem |T| which relates 
the quality of the policy returned by LSB to the quality of the skills learned by the “black box” RL algorithm. (4) 
Experiments demonstrating that LSB can solve MDPs that, given the same representational power, can not be solved 
by a policy derived from a monolithic approach. Thus, planning with learned skills allows us to work with simpler 
representations'Barto et al. 120131, which ultimately allows us to solve larger MDPs. 


2 Background 

Let M = (S, A, P, R, 7 ) be an MDP, where S' is a (possibly infinite) set of states, A is a finite set of actions, P is a 
mapping from state-action pairs to probability distributions over next states, R maps each state-action pair to a reward 
in [0,1], and 7 £ [0,1) is the discount factor. While assuming the rewards are in [0,1] may seem restrictive, any 
bounded space can be rescaled so that this assumption holds. A policy 7r(a|s) gives the probability of executing action 
a £ A from state s £ S. 

Let M be an MDP. The value function of a policy n with respect to a state s £ S is Vf f (s) = 
e[E” 7* 1 R(st, at)|so = s] where the expectation is taken with respect to the trajectory produced by following 
policy 7 r. The value function of a policy 7r can also be written recursively as 


Vm(s) = Ea^pis) [R{s, a)] + 7 E s'~P(-| s ,7r) [^0')] , (1) 

which is known as the Bellman equation. The optimal Bellman equation can be written as Vf { (s) = 
max a E [R(s, a)] + 7 E s /^ P (.| S 7r ) [U*(s')] . Let e > 0. We say that a policy 7 r is e-optimal if V^(s) > VJf(s) — e 
for all s £ S. The action-value function of a policy 7 r can be defined by Q^(s,a) = E^^.i^ [f?(s,a)] + 
7 E s ^ P (.| sj7r ) [TA 71 - (s')] , for a state s £ S and an action a £ A, and the optimal action-value function is denoted 
by Q* m {s , a). Throughout this paper, we will drop the dependence on M when it is clear from context. 


3 Skills 

One of the key ideas behind skills is that they may be learned locally, but they can be used throughout the entire 
state-space. We present a new formal definition for skills and a skill policy. 

Definition 1 . A skill a is defined by a pair (irg,/3), where 7 ig is a parametric policy with parameter vector 9 and 
/3 : S —y {0,1} indicates whether the skill has finished (i.e., /3(s) = 1) or not (i.e., fi(s) = 0) given the current state 
s £ S. 

Definition 2. Let E be a set of m > 1 skills. A skill policy p is a mapping p : S —► [m] where S is the state-space 
and [m] is the index set over skills. 

A skill policy selects which skill to initialize from the current state by returning the index of one of the skills. By 
defining skill policies to select an index (rather than the skill itself), we can use the same policy even as the set of skills 
is adapting. Next we define a Skill MDP , which is a sub-partition of a target MDP as shown in Figure[3] 
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Definition 3. Given a target MDP M = ( S, A, P, P, 7 ) and value function Vm, a Skill MDP for partition Vi is an 
MDP defined by M- = (S', A , P', R! , 7 ) where S' = Vi U {s-t} where st is a terminal state and A is the action set 
from M. The transition probability function P' (s'\s, a) and reward function R'(s, a) are defined below. 

P'(s'\s,a)= R'(s,a) = 


{ P(s'|s, a) if s € Vi A s' € Vi 

E yesyVi P (V I s ’ a ) l f s e V i A s' = s T 

1 ifs = st A s' = st 

0 if s = st A s' 7 ^ st 


{ 0 if s = st 

'£ l P{s'\s,a)R(s,a) if s ± s T A s' ^ s T 

s'GVi 

YjP{s,o,,y) ifs^s T /\s' = s T 

y&S\Vi 


where ijj(s, a, y) = P(y\s , a) (R(s, a) + 7 Vm(v)), and 7 is f/?e discount factor from M. 


A Skill MDP M[ is an episodic MDP that terminates once the agent escapes from V, and upon terminating receives a 
reward equal to the value of the state the agent would have transitioned to in the target MDP Therefore, we construct 
a modified MDP called a Skill MDP and apply a planning or RL algorithm to solve it. The resulting solution is a skill. 
Each Skill MDP M\ is defined within the partition P,. 


Given a good set of skills, planning can be significantly faster Sutton et al. 119991, Mann and M an nor 2014) . How¬ 
ever, in many domains we may not be given a good set of skills. Therefore it is necessary to learn this set of skills 
given the unsatisfactory skill set. In the next section, we introduce an algorithm for dynamically improving skills via 
bootstrapping. 


4 Learning Skills via Bootstrapping (LSB) Algorithm 


Algorithm 1: Learning Skills via Bootstrapping 

Require: M {Target MDP}, V {Partitioning of S}, 

K {# Iterations} 

1 : to 4- \V\ {# of partitions.} 

2: p(s) = argmax ie[m] I{s <E PJ 
3: Initialize E with m skills. {1 skill per partition.} 

4: for fc = 1,2,..., K do {Do K iterations.} 

5: for i = 1, 2,..., m do {One update per skill.} 

6 : Policy Evaluation: 

7: Evaluate p with E to obtain 

8 : Skill Update: 

9: Construct Skill MDP M[ from M & V^’ E) 

10 : Solve M[ obtaining policy 7 rg 

IT <T- <r~ (Tt e ,fii) 

12 : Replace < 7 ,; in E by <j( 

13: end for 

14: end for 

15: return (p, E) 

Learning Skills via Bootstrapping (LSB, Algorithm [TJ takes a target MDP M, a partition P over the state-space and 
a number of iterations K > 1 and returns a pair (p. E) containing a skill policy p and a set of skills E. The number 
of skills m = |P | is equal to the number of classes in the partition P (line 1). The skill policy p returned by LSB is 
defined (line 2) by 

p(s) = arg max I {s £ Vi} , (2) 

i£[ra] 

where !{•} is the indicator function returning 1 if its argument is true and 0 otherwise and P, denotes the i th class in 
the partition P. Thus p simply returns the index of the skill associated with the partition class containing the current 
state. On line 3, LSB could either initialize E with skills that we believe might be useful or initialize them arbitrarily, 
depending on our level of prior knowledge. 


I arget MDP M 


SB) 



Figure 3: A partitioning of a target MDP in 
the pinball domain. Each sub-partition (par¬ 
tition class) i represents the skill MDP Mi- 
Note that, so long as the classes overlap one 
another and the goal region is within one of 
the classes, near-optimal convergence is guar¬ 
anteed. Therefore, the entire state-space does 
not have to be partitioned. 
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Next (lines 4 - 14), LSB performs K iterations. In each iteration, LSB updates the skills in E (lines 5-13). Remember 
that the value of a skill depends on how it is combined with other skills (e.g.. Figure [Q l failed because a single TEA 
prevented reaching the goal). If we allowed all skills to change simultaneously, the skills could not reliably bootstrap 
off of each other. Therefore, LSB updates each skill individually. Multiple iterations are needed so that the skill set 
can converge (Figure^. 


The process of updating a skill (lines 6 - 12) starts by evaluating ft with the current skill set E (line 6 ). Any number 


of policy evaluation algorithms could be used here, such as TD(A) with function approximation Sutton and Barto 
] 1998 | or LSTD )Boyanl||2002|, modified to be used with skills. In our experiments, we used a straighforward variant 
of LSTD Sorg and Singh ]2010 1. Then we use the target MDP M to construct a Skill MDP M' (line 9 ). Next, LSB 
uses a planning or RL algorithm to approximately solve the Skill MDP M' returning a parametrized policy ng (line 
10). Any planning or RL algorithm for regular MDPs could fill this role provided that it produces a parametrized 
policy. However, in our experiments, we used a simple actor-critic PG algorithm, unless otherwise stated. Then a new 

[ 0 if s GV. 

skill cr' = (ng, Pi) is created (line 11 ) where ng is the policy derived on line 10 and /?i(s) = < ^ otherwise ' 

definition of /3j means that the skill will terminate only if it leaves the i th partition. Finally, we update the skill set E 
by replacing the i th skill with <j[ (line 12). It is important to note that in LSB, updating a skill is equivalent to solving 
a Skill MDP. 


5 Analysis of LSB 

We provide the first convergence guarantee for iteratively learning skills in a continuous state MDP using LSB (Lemma 
1 and Lemma 2, proven in the supplementary material). We use this guarantee as well as Lemma 2 to prove Theorem 
[T| This theorem enables us to analyze the quality of the policy returned by LSB. It turns out that the quality of the 
policy depends critically on the quality of the skill learning algorithm. An important parameter for determining the 
quality of a policy returned by LSB is the skill learning error defined below. 

Definition 4. Let V be a partition over the target MDP’s state-space. The skill learning error is 

Pv = max pi , (3) 

i£[rrj] 

where pi is the smallest pi > 0, such that 

(s) — V^f, (s) < Pi ,for all s £ Vi and ng is the policy returned by the skill learning algorithm executed on M[. 

The skill learning error quantifies the quality of the Skill MDP solutions returned by our skill learning algorithm. If we 
used an exact solver to leant skills, then p-p = 0. However, if we use an approximate solver, then r/-p will be non-zero 
and the quality will depend on the partition V. Generally, using finer grain partitions will decrease p-p. However, 
Theorem [I] reveals that adding too many skills can also negatively impact the returned policy’s quality. 

Theorem 1. Let e > 0. If we run LSB with partition V for K > log 7 (e(l — 7 )) iterations, then the algorithm returns 
policy ip = (p, , E) such that 

II Vm ~ II 00 ^ 2 £ ’ (4) 

where m is the number of classes in V. 

The proof of Theorem T]is divided into three parts (a complete proof is given in the supplementary material). The main 
challenge to proving Theorem [T| is that updating one skill can have a significant impact on the value of other skills. 
Our analysis starts by bounding the impact of updating one skill. Note that E represents a skill set and E, represents a 
skill set where we have updated the i th skill (corresponding to the i th partition class V,) in the set. (1) First, we show 
that error between Vf, the globally optimal value function, and is a contraction when s £ Vi and is bound 

by \\Vf[ — V ||oo + jzfj otherwise (Lemma 1). (2) Next we apply an inductive argument to show that updating 
all m skills results in a 7 contraction over the entire state space (Lemma 2). (3) Finally, we apply this contraction 
recursively, which proves Theorem[I] 

This provides the first theoretical guarantees of convergence to a near optimal solution when iteratively learning a set 
of skills E in a continuous state space. Theorem [I] tells us that when the skill learning error is small, LSB returns a 
near-optimal policy. The first term on the right hand side of ([4]) is the approximation error. This is the loss we pay for 
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the parametrized class of policies that we learn skills over. Since m represents the number of classes defined by the 
partition, we now have a formal way of analysing the effect of the partitioning structure. In addition, complex skills 
do not need to be designed by a domain expert; only the partitioning needs to be provided a-priori. The second term 
is the convergence error. It goes to 0 as the number of iterations K increases. 


At first, the guarantee provided by Theorem[T]may appear similar to (Haus krecht et al.| (l998 1, Theorem 1). However, 
Hauskrecht et al. 1 1998] derive TEAs only at the beginning of the learning process and do not update them. On the 


other hand, LSB updates its skill set dynamically via bootstrapping. Thus, LSB does not require prior knowledge of 
the optimal value function. 


Theorem [T] does not explicitly present the effect of policy evaluation error, which occurs with any approximate policy 
evaluation technique. However, if the policy evaluation error is bounded by v > 0, then we can simply replace r/p in 
(|4]i with (r/-p + u). Again, smaller policy evaluation error leads to smaller approximation error. 


6 Experiments and Results 


We performed experiments on three well-known RL benchmarks: Mountain Car (MC), Puddle World (PW) Sutton 
11996] and the Pinball domain (Konidaris and Barto|p009| . The MC domain has similar results to PW and therefore 
has been moved to the supplementary material. We use two variations for the Pinball domain, namely maze-world, 
which we created, and pinball-world which is one of the standard pinball benchmark domains. Our experiments show 
that, using a simple policy representation, the monolithic approach is unable to adequately solve the tasks in each 
case as the policy representation is not complex enough. However, LSB can solve these tasks with the same simple 
policy representation by combining bootstrapped skills. These domains are simple enough that we can still solve them 
using richer representations. This allows us to compare LSB to a policy that is very close to optimal. Our experiments 
demonstrate potential to scale up to higher dimensional domains by combining skills over simple representations. 

Recall that LSB is a meta-algorithm. We must provide an algorithm for Policy Evaluation (PE) and skill learning. In 
our experiments, for the MC and PW domains, we used SMDP-LSTD Sorg and Singh 2010] for PE and a modified 
version of Regular-Gradient Actor-Critic Bhatnagar et al. |2009) for skill learning (see supplementary material for 
details). In the Pinball domains, we used Nearest-Neighbor Function Approximation (NN-FA) for PE and UCB 
Random Policy Search (UCB-RPS) for skill learning. 


In our experiments, for the MC and PW domains, each skill is simply represented as a probability distribution over 
actions (independent of the state). We compare their performance to a policy using the same representation that has 
been derived using the monolithic approach. Each experiment is run for 10 independent trials. A 2 x 2 grid partitioning 
is used for the skill partition in these domains, unless otherwise stated. Binary-grid features are used to estimate the 
value function. In the pinball domains, each skill is represented by 5 polynomial features corresponding to each state 
dimension and a bias term. A4x 1 x 1 x 1 grid-partitioning is used for maze-world and a4x3xlxl partitioning 
is used for pinball-world. The value function is represented by a KD-Tree containing 1000 state-value pairs uniformly 
sampled in the domain. A value for a particular state is obtained by assigning the value of the nearest neighbor to that 
state that is contained within the KD-tree. Each experiment in the pinball domain has been run for 5 independent trials. 
These are example representations. In principal, any value function and policy representation that is representative of 
the domain can be utilized. 


6.1 Puddle World 

Puddle World is a 2-dimensional world containing two puddles. A successful agent should navigate to the goal 
location, avoiding the puddles. The state space is the ( x , y) location of the agent. Figure]^/ compares the monolithic 
approach with LSB (for a 2 x 2 grid partition). The monolithic approach achieves low average reward. However, with 
the same restricted policy representation, LSB combines a set of skills, resulting in a richer solution space and a higher 
average reward as seen in Figure [4]:/. This is comparable to the approximately optimal average reward attained by 
executing Approximate Value Iteration (AVI) for a huge number of iterations. In this experiment LSB is not initiated 
in the partition class containing the goal state but still achieves near-optimal convergence after only 2 iterations. 

Figure |4fj compares the performance of different partitions where a 1 x 1 grid represents the monolithic approach. 
The skill learning error r/p is significantly smaller for all the partitions greater than lxl, resulting in lower cost. On 
the other hand, according to Theorem 1, adding more skills m increases the cost. A tradeoff therefore exists between 
r]p and m. In practice, /y/• tends to dominate m. In addition to the tradeoff, the importance of the partition design is 
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Averaae Reward Average Cost for Different Partitions Repeatable Skills 



(a) (b) (c) 

Figure 4: The Puddle World domain: (a) The average reward for the LSB algorithm generated by the LSB skill policy. 
This is compared to the monolithic approach that attempts to solve the global task as well as an approximately optimal 
policy derived using Q-learning (applied for a huge number of iterations). ( b ) The average cost (negative reward) for 
each grid partition, (c) Repeatable skills plot. 


Maze-world 



(a) 



“5 10 15 
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(c) 
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Pinball-world 



(d) 



Figure 5: The Pinball domains: (a) The maze-world domain. ( b ) The average reward for the LSB algorithm generated 
by the LSB skill policy for maze-world. This is compared to the monolithic approach that attempts to solve the global 
task as well as an approximately optimal policy derived using Approximate Value Iteration (AVI) executed for a huge 
number of iterations, (c) The learned value function for the maze-world domain. ( d) The pinball-world domain, (e) 
The average reward for the LSB algorithm generated by the LSB skill policy in the pinball-world domain. In this 
domain, LSB converges after a single iteration as we start LSB in the partition containing the goal. (/) The learned 
value function. 
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evident when analyzing the cost of the 3x3 and 4x4 grids. In this scenario, the 3x3 partition design is better suited 
to Puddle World than the 4x4 partition, resulting in lower cost. 

6.2 Skill Generalization 

In the worst case, the number of skills learned by (LSB) is based on the partition. However, LSB may learn similar 
skills in different partition classes, adding redundancy to the skill set. This suggests that skills can be reused in different 
parts of the state-space, resulting in less skills compared to the number of partition classes. To validate this intuition, 
a 4 x 4 grid was created for both the Mountain Car and Puddle World domains. We ran LSB using this grid on each 
domain. Since it is more intuitive to visualize and analyze the reusable skills generated for the 2D Puddle World, we 
present these skills in a quiver plot superimposed on the Puddle World (Figure^). For each skill, the direction (red 
arrows in Figure HH) is determined by sampling and averaging actions from the skill’s probability distribution. As can 
be seen in FigureRk many of the learned skills have the same direction. These skills can therefore be combined into 
a single skill and reused throughout the state-space. In this case, the skill-set consisting of 16 skills can be reduced 
to a reusable skill-set of 5 skills (the four cardinal directions, including two skills that are in the approximately north 
direction). Therefore, skill reuse may further reduce the complexity of a solution. 

6.3 Pinball 


These experiments have been performed in domains with simple dynamics. We decided to test LSB on a domain 
with significantly more complicated dynamics, namely Pinball Konidaris and Barto | 2009) . The goal in Pinball is 
to direct an agent (the blue ball) to the goal location (the red region). The Pinball domain provides a sterner test 
for LSB as the velocity at which the agent is travelling needs to be taken into account to circumnavigate obstacles. 
In addition, collisions with obstacles in the environment are non-linear at obstacle vertices. The state space is the 
four-tuple (x, y, x, y) where x, y represents the 2D location of the agent, and x , y represents the velocities in each 
direction. 


Two domains have been utilized, namely maze-world and pinball-world (Figure [5]2 and Figure [5}/ respectively). For 
maze-world , a4x 1 x 1 x 1 grid partitioning has been utilized and therefore 4 skills need to be learned using LSB. 
After running LSB on the maze-world domain, it can be seen in Figure [5j> that LSB significantly outperforms the 
monolithic approach. Note that each skill in LSB has the same parametric representation as the monolithic approach. 
That is, a five-tuple (1, x, y. x, y). This simple parametric representation does not have the power to consistently solve 
maze-world using the monolithic approach. However, using LSB this simple representation is capable of solving the 
task in a near-optimal fashion as indicated on the average reward graph (FigureBb) and resulting value function (Figure 

0 >- 

We also tested LSB on the more challenging pinball-world domain (|5]i). The same LSB parameters were used as in 
maze-world, but the provided partitioning was a4x 3x 1 x 1 grid. Therefore, 12 skills needed to be learned in this 
domain. More skills were utilized for this domain since the domain is significantly more complicated than maze-world 
and a more refined skill-set is required to solve the task. As can be seen in the average reward graph in Figure[5]?, LSB 
clearly outperforms the monolithic approach in this domain. It is less than optimal but still manages to sufficiently 
perform the task (see value function. Figure |5]f). The drop in performance is due to the complicated obstacle setup, 
the non-linear dynamics when colliding with obstacle edges and the partition design. 


7 Discussion 

In this paper, we introduced an iterative bootstrapping procedure for learning skills. This approach is similar to 
(and partly inspired by) skill chaining Konidaris and Barto | 2009| . However, the heuristic approach applied by skill 
chaining may not produce a near-optimal policy even when the skill learning error is small. We provide theoretical 
results for LSB that directly relate the quality of the final policy to the skill learning error. LSB is the first algorithm 
that provides theoretical convergence guarantees whilst iteratively learning a set of skills in a continuous state space. 
In addition, the theoretical guarantees for LSB enable us to interlace skill learning with Policy Evaluation (PE). We 
can therefore perform PE whilst learning skills and still converge to a near-optimal solution. 


In each of the experiments, LSB converges in very few iterations. This is because we perform policy evaluation 
in between each skill update, causing the global value function to converge at a fast pace. Initializing LSB in the 
partition class containing the goal state also results in value being propagated quickly to subsequent partition classes 
and therefore fast convergence. However, LSB can be initialized from any partition class. 


One limitation of LSB is that it learns skills for all partition classes. This is a problem in high-dimensional state- 
spaces. However, the problem can be overcome, by focusing only on the most 
One way to identify these regions is by observing an expert’s demonstrations 


important regions of the state-space. 
Abbeel and Ng 20051, |Argali et al. 
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1 2009) . In addition, we could apply self-organizing approaches to facilitate skill reuse Moerman [ 2009| . Skill reuse 
can be especially useful for transfer learning. Consider a multi-agent environment Garant et al.f |2015| where many 
of the agents may be performing similar tasks which require a similar skill-set. In this environment, skill reuse can 
facilitate learning complex multi-agent policies (co-learning) with very few samples. 

Given a task, LSB can learn and combine skills, based on a set of rules, to solve the task. This structure of learned 
skills and combination rules forms a generative action grammar Summers-Stay et al. | 2012| which paves the way for 
building advanced skill structures that are capable of solving complex tasks in different environments and conditions. 

One exciting extension of our work would be to incorporate skill interruption, similar to option interruption. Option 
intermption involves terminating an option based on an adaptive interruption rule Sutton et al. 1 1999| . Options are 
terminated when the value of continuing the current option is lower than the value of switching to a new option. 
This also implies that partition classes can overlap one another, as the option interruption rule ensures that the option 
with the best long term value is always being executed. Mankowitz et al. 1 2014) interlaced Sutton’s interruption rule 
between iterations of value iteration and proved convergence to a global optimum. In addition, they take advantage of 
faster convergence rates due to temporal extension by adding a time-based regularization term resulting in a new option 
intermption rule. However, their results have not yet been extended to use with function approximation. Comanici and 
jPrecup | 2010[ have developed a policy gradient technique for learning the termination conditions of options. Their 
method involves augmentation of the state-space. However, the overall solution converges to a local optimum. 
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A Appendix 
A.l LSB Skill MDP 

The formal definition of a Skill MDP is provided here for completeness. 

Definition 5. Given a target MDP M = ( S , A, P , R. 7) and value function Vm, ei Skill MDP for partition Vi is an 
MDP defined by M[ = (S', A , P', R', 7) where S' = Vi U {st} where St is a terminal state and A is the action set 

if s £ Vi A s' £ Vi 
if s £ Vi A s' = st 
if s = st A s' = st ’ 
if s = st A s' 7^ st 

ifs = s T 

if s s T As' s T 

■) 

if s st A s' = st 

where ip(s, a, y) = P(y\s , a) ( R(s, a) + 7 and 7 is the discount factor from M. 

A.2 Proof of Theorem 1 

In this section, we prove Theorem 1. 

We will make use of the following notations. For m > 1, we will denote by [m] the set {1, 2,..., m}. Let a = (ng, ff) 
be a skill. Suppose the skill a is initialized from a state s. 

1. Pp 6 (s'\s,t) denotes the probability that the skill will terminate (i.e., return control to the agent) in state s' 
exactly t > 1 timesteps after being initialized. 

2. R^° s denotes the expected, discounted sum of rewards received during cr’s execution. We use the~ notation 
to emphasize that this quantity is discounted. 

The proof of Theorem 1 will make use of two lemmas. The first lemma (Lemma |T]> demonstrates a relationship 
between the value of a skill policy before and after replacing a single skill. Within the skill’s partition class there is 
a 7-contraction (plus some error), but outside the skill’s partition class the value may become worse by a bounded 
amount. The second lemma (Lemma [2} uses Lemma [I] to prove that after a complete iteration (each skill has been 
update once), there is a 7-contraction (plus some error) over the entire state-space. We then prove Theorem 1 by 
applying the result of Lemma[2]recursively. 

Lemma 1. Let 

1. M be the target MDP, 

2. V a partition of the state-space, 

3. p, be the skill policy defined by V (i.e., p(s) = argmax ie [ m ] I{s £ Vi}), 

4. E be an ordered set ofm > 1 skills, and 

5. i £ [m] be the index of the i th skill in E. 

Suppose we apply A to the Skill MDP M[ defined by M and V^’^, obtain ttq, construct a new skill cr' 
and create a new skill set S' = (E\{cr,;}) U {cr'} by replacing the i th skill with the new skill, then 

Vses , V^\s) - V^'\s) < 


= (v 8, Pi), 

(5) 


from M. The transition probabilities 


P{s'\s,a ) 


the reward function 


0 

R ( s , a) = \ s'GVi 

JjP{s,a,y) 
ves\Vi 
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and 


^m(s)-V^’ E,> ( S )< 


t/* t/xA 4 ^) 

V M V M 

T/* 


X3^ if s& Vi, and 


v 

1-7 


otherwise. 


( 6 ) 


where rj is the skill learning error. 


Proof. 

Proving that <|5j holds: 

First, we show that 0 holds. For each skill Ui £ £, we will denote the skill’s policy and termination rule by tt, and 
f3i, respectively. If s £ V 3 where j i , then 

v^\s) -= fe, s + EE - (r2, + E 


,f3j,S t =1 


t=l 


M 


M K M 

< 7 

On the other hand, if s £ Vi, then 

vif E> (a) - v7’ S ' } (*) = y { M^ (*) + (^M' (a) - V* M , (a)) - V^ (a) By inserting 0 = (v£, (a) - V£, (a)) . 

= (v7’ S> (a) - Vfj, (a)) + (v£, (s) - v7’ S } (a)) Regrouping terms. 




<0 + 

< V 

In either case, 

^M K°) " V M 

which leads to <[5j by recursing on this inequality. 


7r s> (s)-7r E,> (s)<7 


T_ -r 7 /i,E / ) 

V M V M 


The definition of M[ =>■ Vf r , (s) > (s) . 

By Definition 4. 

+ V , 


Proving that 0 holds: 

If s f Vi, then by 0, we have 




< 


v* — V 

V M V M 




1-7 


Now we consider the case where s £ Vi. Let g\ = (no, Pi) be the newly introduced skill. We will denote by <r'(/i, S') 
the policy that first executes cr' from a state s £ V% and then follows the policy (/x, E) thereafter. 


VseVi > V£(a)-V&*’ E '>(a) = V^(a) - 


< 


K M 


^ + E7 i E^(«'lM)W 

t=l s' 

- fe + E 7 *E^>lM)vif’ E V)) + ?? 


t=l s' 


E7‘E p£(s'\s,t) (VMS') - y&^V)) +v 
(v^(a')- (v^’ E >(a') 

^ 7 (i 7 ) +?7 


< 

t=l S' 

< 7 


1-7 


< 7 


<M,£) 


V* — V 
V M 

T/* _ 


V 

1-7 


+ 7/ 


□ 
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Lemma 2. Suppose we execute LSB for a single iteration. Let £ be the set of skills at the beginning of the iteration 
and £' be the set of skills after each skill has been updated and the iteration has completed, then 


T/* > 

V M ~ V M 


< 7 


Vm~V m 




mr] 

1-7 


(7) 


Proof Without loss of generality, we assume that the skills are updated in order of increasing index. We denote the 
skill set at the beginning of the iteration by £ and the skill set at the end of the iteration by £' after all of the skills 
have been updated once. It will be convenient to refer to the intermediate skill sets that are created during an iteration. 
Therefore, we denote by £ , 1 , T ,' 2 ,..., £' m = £', the set of skills after the first skill was replaced, the second skill was 
replaced, ..., and after the m th skill was replaced, respectively. 

We will proceed by induction on the skill updates. As the base case, notice that by Lemma[T] we have that 

7 Vh-V&M +^~ if s € Vi, and 


V* M {s)~V^\s)< 


T/* 

V M V M 


V 

1-7 


otherwise. 


Let 1 < i < m. Now suppose for we have that 


vM-v^\s)< 


V* — V 

V M V M 


<M,S> 


V* —V 
V M V M 




+ Y=z ifse U 'Pp and 

je[i\ 


IT] 

1-7 


otherwise. 


Now for £' we have several cases: 


1- s £ Vi+i- 

By applying Lemma[2] we see that 


2. s G U Pf 
j e[*l 

By applying Lemma[2] we see that 


V A f( s )-V^' i+1 \s) < 7 

< 7 ( 

< 7 


T/* 

V M V M 

T/* t 

V M V M 

T/* 

V M V M 


V 

1-7 


+ 


1—7 J 1—7 

(*+l)T7 

1-7 


V* M {s) - v£ >K+ 1 \s) < V* M {s) - V^ (a) + V^' K) (a) - V$' K+l) (a) 

< V^(s)~V^\s) 

< (7 

= 7 


v M 

■Av; 

/ M 

T/* T/(M^) 


T7* _ T/ 




M 

1-7 
1 ir ) 

r 1-7 
(*+l)q 

1-7 


1-7 


3 .si u Vf 

je[*+i] 

Again, by Lemma [2] we see that 

Vm{*)-V m 


<M,S' +1 ) 


(s) < 
< 
< 


T/* 

T/* 

t/* 


_ _JL_ 

1-7 
1 ^ 

^ 1-7 

(^+l)y? 

1-7 


1-7 
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Thus by the principle of mathematical induction the statement is true for i = 1.2...., m - 1. After performing m 
updates, (J V t = S. Thus, we obtain the 7-contraction over the entire state-space. 

ie[H 

□ 


A.2.1 Proof of Theorem 1 
Proof, (of Theorem 1) 


The loss of all policies is bounded by Therefore, by applying Lemma |2j and recursing on (|7j) for K > 

log 7 (e(l — 7)) iterations, we obtain 


II Vfc-V£, 


< 'y K ( -d— ^ J_ mr l 

< (ji.) 

- Ml - 7)) (a) + tfSfr 


(1 7 ) 2 


mr) 


■ e . 


□ 


A.3 Experiments 


We performed experiments on three well-known RL benchmarks: Mountain Car (MC), Puddle World (PW) |Sutton 
11996| and the Pinball domain [Konidaris and Bartoj |2009| . The MC domain is discussed here. The PW and Pinball 
domains are found in the main paper. The purpose of our experiments is to show that LSB can solve a complicated 
task with a simple policy representation by combining bootstrapped skills. These domains are simple enough that 
we can still solve them using richer representations. This allows us to compare LSB to a policy that is very close to 
optimal. Our experiments demonstrate potential to scale up to higher dimensional domains by combining skills over 
simple representations. 


Recall that LSB is a meta-algorithm. We must provide an algorithm for Policy Evaluation (PE) and skill learning. In 
our experiments, for the MC domain, we used SMDP-LSTD jSorg and Singh) |2010| for PE and a modified version of 
Regular-Gradient Actor-Critic Bhatnagar et al. 12009| for skill learning. 


In our experiments, for the MC domain, each skill is simply represented as a probability distribution over actions 
(independent of the state). We compare the performance to a policy using the same representation that has been 
derived using the monolithic approach. Each experiment is run for 10 independent trials. A 2 x 2 grid partitioning is 
used for the skill partition in this domain, unless otherwise stated. Binary-grid features are used to estimate the value 
function. 


These are example representations. In principal, any value function and policy representation that is representative of 
the domain can be utilized. 


A.3.1 Mountain Car 

The Mountain Car domain consists of an under-powered car situated in a valley. The car has to leverage potential 
energy to propel itself up to the goal, which is the top of the rightmost hill. The state-space is the car’s position and 
velocity (p,v). 

Figure[6]:i compares the monolithic approach with LSB (for a 2 x 2 grid partition). The monolithic approach achieves 
low average reward. However, with the same restricted policy representation, LSB combines a set of skills, resulting 
in a richer solution space and a higher average reward as seen in Figure |5Jt. This is comparable to the approximately 
optimal average reward. Convergence is achieved after a single iteration since LSB is initiated from the partition 
containing the goal location causing value to be instantaneously propagated to subsequent skills. 

Figure |6j) compares the performance of different partitions where a 1 x 1 grid represents the monolithic approach. As 
seen in the figure, the cost is lower for all partitions greater than lxl which is consistent with the results in the main 
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Figure 6: The Mountain Car domain: (a) The average reward for the LSB algorithm generated by the LSB skill policy. 
This is compared to the monolithic approach that attempts to solve the global task as well as an approximately optimal 
policy derived using Q-learning. (b) The average cost (negative reward) for different partitions (i.e., grid sizes). 


Mountain Car 




Optimal Solution 


Figure 7: The Mountain Car Domain: Comparison of value functions for the monolithic approach (lxl grid), the 
best partition using LSB (4x4 grid), and an approximately optimal value function (derived using Q-learning, applied 
for a huge number of iterations, with a fine discretization of each task’s state-space). 


paper. Figure [7] indicates the resulting value functions for various grid sizes. The value function from the monolithic 
approach is not capable of solving the task whereas the 4x4 grid partition’s value function is near-optimal. 


A.3.2 Puddle World 

Figure [8] compares the value functions for different grid sizes in Puddle World. The monolithic approach (lxl 
partition) provides a highly sub-optimal solution since, according to its value function, the agent must travel directly 
through the puddles to reach the goal location. The 3x3 grid provides a near-optimal solution. 


A.4 Modified Regular-Gradient ActorCritic 


We used a very simple policy gradient algorithm (Algorithm^ for skill learning. The algorithm is based on Regular- 
Gradient ActorCritic Bhatnagar et al. 2009) . The algorithm differs from Regular-Gradient ActorCritic because it uses 
different representations for approximating the value function and the policy. For a state action pair (s, a) £ S x A, a 


functions a) £ and £( s, a) £ mapped (s, a) to a vector with dimension d and d', respectively. We use the 
representation given by <fi to approximate the value function and the representation given by ( to represent the policy. 
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Puddle World 



Figure 8: The Puddle World Domain: Comparison of value functions for the monolithic approach (lxl grid), the best 
partition using LSB (3x3 grid), and an approximately optimal value function (derived using Q-learning, applied for 
a huge number of iterations, with a fine discretization of each task’s state-space). 


The parametrized policy was defined by 


7r e (a|s) 


exp (fl T C(s,a)) 

E exp(6» T C(s,a')) 

a'GA 


( 8 ) 


where 6 € are the learned policy parameters. We used representations such that d' <C d, meaning that the policy 
parametrization was much simpler than the representation used to approximate the value function. This allowed us to 
get an accurate representation of the value function, but restrict the policy space to very simple policies. 


Algorithm 2 Modified Regular-Gradient ActorCritic 


Require: 

1. (j >: mapping from states to a vector representation used to approximate the value, 

2. oj : value function approximation parameters, 

3. ( : mapping from states to a vector representation used to approximate the policy, 

4. 9 : policy parameters, 

5. a : the value learning rate (fast learning rate), 

6. j3 : the policy learning rate (slow learning rate, i.e., i3 < a), and 

7. (s, a, s', r) : a state-action-next-state-reward tuple. 


1 : 

2 : 

3: 

4: 

5: 

6 : 

7: 


Vnew ( r + 7 E ^e{a'\s')oj T a') I {Estimate V^ e given the new sample.} 

Vold <— w T </>(s, a) {Use the current value function approximation to estimate the value.} 
6 <r- ^Vnew — Told) {Compute the temporal difference error.} 
u/ ui + aS { Update the value function weights using the fast learning rate a. } 


^s,a •<— C( s j a) — E 7T e{ a ' i s)C(s, a') {Compute the “compatible features” 

a'eA 

8’ 6 + { Update the policy parameters using the slow learning rate IT. } 

Return (<*/, 9') { Updated value function and policy parameters. } 


Bhatnagar et al. 


|2009 


■} 


Although using different representations for approximating the value function and the policy strictly violates the policy 
gradient theorem Sutton et al. |2000| , it still tends to work well in practice. 

In our experiments, we used a fast learning rate of a = 0.1 and a slow learning rate of /3 = 0.2a. Value function and 
policy parameters were initialized to zero vectors. 


A.5 Pinball Demonstration Videos 


There are two videos attached showing a demonstration of a policy learned by LSB for the Pinball domain Konidaris 
and Barto [ [2009] ]. Both of these domains are analyzed and discussed in the main paper. The first video shows a policy 
learned for Maze-world and the second video shows a policy learned for Pinball-world , one of the standard pinball 
benchmark domains. The objective of the agent (blue ball) is to circumnavigate the obstacles and reach the goal region 
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(red ball). A colored square is superimposed onto the active skill indicating the current skill or partition class being 
executed. 
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